The induction of a tree

The accuracy of the tree model generated from data will depend on the data sample (i.e. the data source) used for induction. The Field Usage tab induces the tree from the Training Table. The induction algorithm of MinedTree objects is binary in that it creates a two way branch at every split in the tree. The selection of the attribute to split on at every stage in the tree building is done according to the information contents of each attribute in terms of classifying the outcome groups. The most informative attribute is selected at every branching point. For discrete attributes the value groups are split between the two branches so as to maximise the information content of the attribute. For numeric and date/time attributes the two way split is based on a threshold which is derived to maximise the information content of the attribute. When the outcome is numeric or date/time, the standard deviation of the data filtering to both branches are used as the basis for selecting the best attribute and the best threshold.

One of the parameters which must be specified before the induction process commences is the Minimum Examples in a branch. This figure gives the induction algorithm a criteria for stopping the creation of new branches from any given point in the tree if the number of data samples filtering to that point falls below this limit. This limit provides defence against noise in the data. In effect it will only allow branches to be developed from an acceptable number of records. Normally this figure is set depending on the total number of records (table rows) and your estimated level of noise in the data. As a guideline, set the Minimum Examples in a branch to 2% of the number of records. If you are analysing what might be considered as 'clean' data where every record counts, then you should set this figure to a small number, or even to 1. For numeric outcomes you can also set a forward pruning parameter; the F-test cut-off value. You can also set parameters to stop branching beyond a certain level of significance (the F-test for numeric outcomes and the Chi-square significance level for discrete outcomes).